Regression diagnostics

2 minute read

Published:

This post covers Introduction to probability from Statistics for Engineers and Scientists by William Navidi.

Basic Ideas

  • Interpreting the Slope of the Least-Squares Line

    • If the x-values of two points on a line differ by $1$, their y-values will differ by an amount

      equal to the slope of the line.

    • If the values of the explanatory variable for two individuals differ by $1$, their predicted values will differ by $ \hat \beta_1$.

    • If the values of the explanatory variable differ by an amount $d$, then their predicted values will differ by $ \hat \beta_1 d$.

  • The Estimates Are Not the Same as the True Values

    • It is important to understand the difference between the least-squares estimates $ \hat \beta_0$ and $ \hat \beta_1$, and the true values $\beta_0$ and $\beta_1$.
    • The true values are constants whose values are unknown.
    • The estimates are quantities that are computed from the data. We may use the estimates as approximations for the true values.
    • Therefore $ \hat \beta_0$ and $ \hat \beta_1$ are random variables, since their values vary from experiment to experiment.
    • To make full use of these estimates, we will need to be able to compute their standard deviations.
  • The Residuals Are Not the Same as the Errors

  • Don’t Extrapolate Outside the Range of the Data

    • For many variables, linear relationships hold within a certain range, but not outside it.
    • If we extrapolate a least squares line outside the range of the data, therefore, there is no guarantee that it will properly describe the relationship.
    • If we want to know how the spring will respond to a load of 100 lb, we must include weights of 100 lb or more in the data set.
  • Don’t Use the Least-Squares Line When the Data Aren’t Linear

    • When the scatterplot follows a curved pattern, it does not make sense to summarize

      it with a straight line.

  • Measuring Goodness-of-Fit

    • A goodness-of-fit statistic is a quantity that measures how well a model explains a given

      set of data.

    • r is a goodness-of-fit statistic for the linear model.

    • The points on the scatterplot are $(x_i, y_i)$ where $x_i$ is the height of the ith man and $y_i$ is the length of his forearm. \(r^2 = \frac{Regression ~ sum ~ of ~ squares}{ Total ~ sum ~ of ~ squares}\)

    • The sums of squares appearing in this discussion are used so often that statisticians have given them names. They call

      $ \Sigma^n_{i=1}\Sigma (y_i − \hat y_i)^2$ the error sum of squares and $ n_i =\frac {1}{(y_i − y)^2}$ the total sum of squares. Their difference

      $ \Sigma^n_{i=1}(y_i − y)^2 − \Sigma^n_{i =1}(y_i - \hat y_i)^2$

      is called the regression sum of squares.